Conversation
marianotepper
left a comment
There was a problem hiding this comment.
Looks good in general. Some minor comments here and there.
There is one section that may need an update following some recent changes by @jshook.
docs/benchmarking.md
Outdated
| - **HDF5**: The dataset consists of a single HDF5 file with three datasets labelled `train`, `test` and `neighbors`, representing the base vectors, query vectors and the ground truth. | ||
|
|
||
| General procedure for running benchmarks: | ||
| - Specify the dataset names to benchmark in `datasets.yml`. |
There was a problem hiding this comment.
This is unclear. Do we need to modify datasets.yml, or chose from it? If the latter, where do we specify the chosen ones?
There was a problem hiding this comment.
Added a link to a following section which describes how "specifying datasets" works, and also reworked the section to make it more clear. Let me know if you'd prefer I elaborate a bit here.
|
|
||
| ### Using Fvec/Ivec datasets | ||
|
|
||
| Using fvec/ivec datasets requires them to be configured in `MultiFileDatasource.java`. Some datasets are already pre-configured; these will be downloaded and used automatically on running the benchmark. |
| import io.github.jbellis.jvector.vector.types.VectorFloat; | ||
| import io.github.jbellis.jvector.vector.types.VectorTypeSupport; | ||
|
|
||
| public class VectorIntro { |
There was a problem hiding this comment.
There should be a way to grab sections of this file from draft-hello.md, so that we do not have to manually sync both files.
There was a problem hiding this comment.
A quick couple of searches didn't turn up a simple way to do this such that it works in all contexts. I don't think we should spend too much energy on this, because anyone changing VectorIntro.java should be making sure that intro-tutorial.md is also in sync, considering that it's content is heavily reliant on VectorIntro. I'll add comment to VectorIntro.java making the link more clear.
| - `neighbors.ivec` containing Q K-dimensional integer vectors, one for each query vector, representing the exact K-nearest neighbors for that query among the base vectors. | ||
| The files can be named however you like. | ||
| - Save all three files somewhere in the `fvec` directory in the root of the `jvector` repo (if it doesn't exist, create it). It's recommended to create at least one sub-folder with the name of the dataset and copy or move all three files there. | ||
| - Edit `MultiFileDatasource.java` to configure a new dataset and it's associated files: |
There was a problem hiding this comment.
This section also needs to describe how we are using the environment variable DATASET_HASH and how it needs to be set on the target system or the downloads will fail
There was a problem hiding this comment.
This would be applicable to specific pre-configured datasets (cohere 1M-10M, dpr etc) that are defined in MultiFileDatasource. If I'm not mistaken, accessing these datasets requires not only the dataset hash, but also the credentials of the corresponding S3 bucket. I think it's alright to skip it for now considering that the affected datasets are not public.
Note that this won't prevent the user from externally downloading and using any dataset they have access to.
| int topK = 10; // number of approximate nearest neighbors to fetch | ||
| // You can provide a filter to the query as a bit mask. | ||
| // In this case we want the actual topK neighbors without filtering, | ||
| // so we pass in a virtual bit mask representing all ones. |
There was a problem hiding this comment.
An example of what is meant by filtering when something other than Bits.ALL is used might be useful here.
There was a problem hiding this comment.
I wanted to avoid digressing too much at this point. I've only mentioned filtering since I can't run a search without passing in some Bits instance, and I didn't want to do that with zero commentary. I was planning to elaborate further in other docs, but do you think it's too confusing without it?
Add better documentation for anyone getting started with JVector.