-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[WIP] Revisiting JVector codec #15472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
abernardi597
wants to merge
77
commits into
apache:main
Choose a base branch
from
abernardi597:jvector
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
I took a stab at bringing the OpenSearch JVector codec into Lucene as a codec in
sandbox(see issue #14681) to see how a DiskANN-insipired index might compare to the current generation of HNSW.I made quite a few changes along the way and wanted to cut this PR to share some of those changes/results and maybe solicit some feedback from interested parties. Most notably, I did remove the incremental graph building functionality that is used to speed up merges, though I'd like to add it back and look at the improvements in merge-time for JVector indices. I also made a PR for JVector (datastax/jvector#577) to fix a byte-order inconsistency to better leverage Lucene's bulk-read for floats.
I hooked it up to
lucene-util(PR incoming) for comparison, trying to play into the strengths of each codec while also maintaining similar levels of parallelism. I ran HNSW using 32x indexing threads and force-merging into 1 segment while using 1x indexing thread for JVector backed by a 32x concurrencyForkJoinPoolfor its SIMD operations andForkJoinPool.commonPool()for its other parallel operations. I also fixedoversample=1for both and usedneighborOverflow=2andalpha=2for JVector.These results are from the 768-dim cohere dataset using PQ for quantization in JVector and OSQ in Lucene using a
m7g.16xlargeEC2 instance.This PR is not really intended to be merged, in light of some of the feedback on the previous PR (#14892) that suggests Lucene should try to incorporate some of the learnings rather than add yet another KNN engine.