JVector comes with a built-in benchmarking system in jvector-examples/.../BenchYAML.java.
To run a benchmark
- Decide which dataset(s) you want to benchmark. A dataset consists of
- The vectors to be indexed, usually called the "base" or "target" vectors
- The query vectors
- The "ground truth" results that are used to compute accuracy metrics
- The similarity metric used compute the ground truth (dot product, cosine similarity or L2 distance)
- Configure the parameters combinations for which you want to run the benchmark. This includes index construction parameters, quantization parameters and search parameters.
JVector supports datasets in the fvecs/ivecs format. These consist of three files, for example base.fvecs, queries.fvecs and neighbors.ivecs containing the base vectors, query vectors, and ground truth. (fvecs and ivecs file formats are described here)
The general procedure for running benchmarks is mentioned below. The following sections describe the process in more detail.
- Specify the dataset names to benchmark in
datasets.yml. - Certain datasets will be downloaded automatically. If using a different dataset, make sure the dataset files are downloaded and made available (refer the section on Custom datasets).
- Adjust the benchmark parameters in
default.yml. This will affect the parameters for all datasets benchmarked. You can specify custom parameters for a specific dataset by creating a file called<your-dataset-name>.ymlin theindex-parameterssubfolder. - Decide on the kind of measurements and logging you want and configure them in
run-config.yml.
You can run the configured benchmark with maven:
mvn clean compile exec:exec@bench -pl jvector-examples -amThe datasets you want to benchmark should be specified in jvector-examples/yaml-configs/datasets.yml. You'll notice this file already contains some entries; these are datasets that bench can automatically download and test with minimal additional configuration. Running bench without arguments and without changing this file will cause ALL the datasets to be benchmarked one by one (this is probably not what you want).
To benchmark a single dataset, comment out the entries corresponding to all other datasets. (Or provide command line arguments as described in Running bench from the command line)
Datasets are grouped into categories. The categories can be arbitrarily chosen for convenience and are not currently considered by the benchmarking system.
Dataset similarity functions are configured in jvector-examples/yaml-configs/dataset-metadata.yml.
Example datasets.yml:
category0:
- my-dataset-a
- my-dataset-b
some-other-category:
- another-dataset-a
- another-dataset-bjvector-examples/yaml-configs/index-parameters/default.yml specifies the default index construction and search parameters to be used by bench for all datasets.
You can specify a custom set of a parameters for any given dataset by creating a file called <dataset-name>.yml, with <dataset-name> replaced by the actual name of the dataset. This is the same as the identifier used in datasets.yml. The format of this file is exactly the same as default.yml.
Refer to default.yml for a list of all options.
Most parameters can be specified as an array. For these parameters, a separate benchmark is run for each value of the parameter. If multiple parameters are specified as arrays, a benchmark is run for each combination (i.e. taking the Cartesian product). For example:
construction:
M: [32, 64]
ef: [100, 200]will build and benchmark four graphs, one for each combination of M and ef in {(32, 100), (64, 100), (32, 200), (64, 200)}. This is particularly useful when running a Grid search to identify the best performing parameters.
This file contains configurations for
- Specifying the measurements you want to report, like QPS, latency and recall
- Specifying where to output these measurements, i.e. to the console, or to a file, or both.
The configurations in this file are "run-level", meaning that they are shared across all the datasets being benchmarked.
See run-config.yml for a full list of all options.
Once configured to your liking, you can run the benchmark through maven:
mvn compile exec:exec@bench -pl jvector-examples -amTo benchmark a subset of the datasets in datasets.yml, you can provide a space-separated list of regexes as arguments.
# matches `glove-25-angular`, `glove-50-angular`, `nytimes-256-angular` etc
mvn compile exec:exec@bench -pl jvector-examples -am -DbenchArgs="glove nytimes"Datasets are configured via YAML catalog files under jvector-examples/yaml-configs/dataset-catalogs/. The loader recursively discovers all .yaml/.yml files in that directory tree. See jvector-examples/yaml-configs/dataset-catalogs/local-catalog.yaml for the full format reference.
To add a custom fvecs/ivecs dataset:
- Add a
.yamlfile to the YAML catalog directory, mapping your dataset name to its files:_defaults: cache_dir: ${DATASET_CACHE_DIR:-dataset_cache} my-dataset: base: my_base_vectors.fvecs query: my_query_vectors.fvecs gt: my_ground_truth.ivecs
- Place your fvecs/ivecs files at the paths you specified in the YAML (or specify a
cache_dir/base_urlto fetch them from a remote source). - Add the dataset's similarity function to
jvector-examples/yaml-configs/dataset-metadata.yml:my-dataset: similarity_function: COSINE load_behavior: NO_SCRUB
- Add the dataset name to
jvector-examples/yaml-configs/datasets.ymlso BenchYAML can find it:custom: - my-dataset
For remote datasets, use base_url to specify where files should be downloaded from. The ${VAR} and ${VAR:-default} syntax is supported for environment variable expansion. See the example config for details.