Skip to content

Conversation

@abernardi597
Copy link

@abernardi597 abernardi597 commented Dec 4, 2025

Description

I took a stab at bringing the OpenSearch JVector codec into Lucene as a codec in sandbox (see issue #14681) to see how a DiskANN-insipired index might compare to the current generation of HNSW.
I made quite a few changes along the way and wanted to cut this PR to share some of those changes/results and maybe solicit some feedback from interested parties. Most notably, I did remove the incremental graph building functionality that is used to speed up merges, though I'd like to add it back and look at the improvements in merge-time for JVector indices. I also made a PR for JVector (datastax/jvector#577) to fix a byte-order inconsistency to better leverage Lucene's bulk-read for floats.

I hooked it up to lucene-util (PR incoming) for comparison, trying to play into the strengths of each codec while also maintaining similar levels of parallelism. I ran HNSW using 32x indexing threads and force-merging into 1 segment while using 1x indexing thread for JVector backed by a 32x concurrency ForkJoinPool for its SIMD operations and ForkJoinPool.commonPool() for its other parallel operations. I also fixed oversample=1 for both and used neighborOverflow=2 and alpha=2 for JVector.

These results are from the 768-dim cohere dataset using PQ for quantization in JVector and OSQ in Lucene using a m7g.16xlarge EC2 instance.

recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) vec_disk(MB) vec_RAM(MB) indexType metric
0.965 1.408 1.399 0.994 100000 100 50 64 250 no 4968 5.99 16700.07 10.10 1 298.17 292.969 292.969 HNSW COSINE
0.939 2.186 2.155 0.986 100000 100 50 64 250 no 3485 19.58 5107.77 0.01 1 318.80 292.969 292.969 JVECTOR COSINE
0.963 1.409 1.401 0.994 100000 100 50 64 250 8 bits 5028 8.75 11431.18 12.95 1 372.84 367.737 74.768 HNSW COSINE
0.939 9.524 9.516 0.999 100000 100 50 64 250 8 bits 3525 886.28 112.83 0.01 1 392.79 367.737 74.768 JVECTOR COSINE
0.899 0.967 0.959 0.992 100000 100 50 64 250 4 bits 5076 8.84 11314.78 9.07 1 335.80 331.116 38.147 HNSW COSINE
0.937 3.469 3.457 0.997 100000 100 50 64 250 4 bits 3437 148.70 672.51 0.01 1 356.17 331.116 38.147 JVECTOR COSINE
0.669 0.681 0.673 0.988 100000 100 50 64 250 1 bits 5895 8.04 12439.36 8.84 1 308.42 303.459 10.490 HNSW COSINE
0.730 1.056 1.044 0.989 100000 100 50 64 250 1 bits 2672 51.39 1945.90 0.01 1 328.70 303.459 10.490 JVECTOR COSINE

This PR is not really intended to be merged, in light of some of the feedback on the previous PR (#14892) that suggests Lucene should try to incorporate some of the learnings rather than add yet another KNN engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant