Add AcceptDocs abstraction for accepted KNN docs #15011

shubhamvishu · 2025-07-30T13:40:41Z

Description

Addresses this comment on the PR #14963 to allow both sequential and random access consumption for accepted docs in KNN search.

github-actions · 2025-07-30T13:41:33Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

jpountz

Thank you for looking into it, it looks like this refactoring is working well. I left a first round of suggestions.

jpountz · 2025-07-30T17:23:58Z

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java

+  /**
+   * Random access to the accepted documents.
+   *
+   * @return Bits instance for random access, or null if not available


I think we should apply the usual contract that a null Bits instance means that all docs are accepted?

Suggested change

* @return Bits instance for random access, or null if not available

* @return Bits instance for random access, or null if all documents are accepted

Was this resolved? I do think that "or null if all documents are accepted" is more clear.

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java

jpountz · 2025-07-30T17:33:51Z

...rd-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java

            fieldEntry.similarityFunction,
            getGraphValues(fieldEntry),
-            getAcceptOrds(acceptDocs, fieldEntry),
+            getAcceptOrds(acceptDocs != null ? acceptDocs.getBits() : null, fieldEntry),


Should we set the contract of this method so that acceptDocs is not allowed to be null? This would help save all these (annoying) null checks, and it wouldn't be a great burden on the caller side?

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java

jpountz · 2025-08-08T05:25:25Z

@shubhamvishu I hope you don't mind, I addressed my feedback directly on your branch.

github-actions · 2025-08-08T05:25:33Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions · 2025-08-08T05:33:57Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

jpountz · 2025-08-08T05:37:16Z

We will also need to move the decision between approximate and exact to the codec, but it's probably best done as a follow-up PR? I think that this change is good as-is.

shubhamvishu · 2025-08-09T09:14:01Z

I hope you don't mind, I addressed my feedback directly on your branch.

@jpountz Not at all, thank you so much for taking care of it! Sorry I couldn’t get to it sooner. I really appreciate the assist.

We will also need to move the decision between approximate and exact to the codec, but it's probably best done as a follow-up PR? I think that this change is good as-is.

I agree completely , that is unrelated to this specific change and better to keep as separate PR.

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java

Co-authored-by: Shubham Chaudhary <[email protected]>

github-actions · 2025-08-09T19:32:22Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

shubhamvishu · 2025-08-12T18:27:24Z

@jpountz @benwtrent The test failures seems is unrelated to this change and a side effect of bulk scoring change. The test fails on main also when HNSW is disabled to force exhaustive search and we score bulk score 64 docs now. I wonder if we still want to respect the visitedLimit with bulk scoring which it seems we are not doing anymore? If yes, then it seems like a existing bug which we should fix separately?. Otherwise if not, we could only fix the assertion here. Let me know what you think?

benwtrent · 2025-08-12T20:50:19Z

@shubhamvishu could you open a test failure issue?

@ChrisHegarty what do you think?

shubhamvishu · 2025-08-13T08:08:42Z

@benwtrent I opened an issue #15057 with the details of the test failure

ChrisHegarty

This is generally very nice. I left a few small comments.

ChrisHegarty · 2025-08-13T09:24:19Z

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java

+  /**
+   * Random access to the accepted documents.
+   *
+   * @return Bits instance for random access, or null if not available


Was this resolved? I do think that "or null if all documents are accepted" is more clear.

ChrisHegarty · 2025-08-13T09:25:45Z

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java

+   * @return Bits instance for random access, or null if not available
+   * @throws IOException if an I/O error occurs
+   */
+  public abstract Bits bits() throws IOException;


I don't see any requirement for either this bits or iterator to throw IOException. It can be removed unless there is some other usage not currently in this PR that needs it?

I guess I was thinking of lazily creating the bitset in this method at first, so it would need the IOException. But actually we always need the cardinality of the bitset to decide between random access and sequential access now, so maybe we could simplify things and make AcceptDocs a record around Bits and a DocIdSetIterator. So no exception, no lazy computation of cost() (to you other comment).

Hmm, I spoke too quickly. A flat scorer should never compute the cost() or use the Bits. So we need to both compute the cost and the Bits lazily.

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java

Co-authored-by: Chris Hegarty <[email protected]>

jpountz · 2025-08-13T21:25:34Z

One of Chris' comment helped me realize that we would still load filter matches into a BitSet with a flat scorer, which is undesirable. So I pushed a change that makes the loading into a bitset lazy, so that if a flat KnnVectorsReader only calls AcceptDocs#iterator but neither #cost() nor #bits() then we'd never load matches into a bitset. I had to restrict the contract of AcceptDocs so that it's illegal ta call #cost() nor #bits() after the iterator has advanced.

msokolov · 2025-08-13T21:41:10Z

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java

+  /**
+   * Random access to the accepted documents.
+   *
+   * <p><b>NOTE</b>: This must not be called if the {@link #iterator()} has already been used.


what's the contract if bits() is called first, and then iterator()? Is it possible the iterator will have been consumed in the process of creating the Bits?

You are right. The original iterator will be consumed to create the bit set and replaced with an iterator over the bit set.

So we should disallow calling bits() or cardinality() after iterator()?

Also if we first call the iterator() then we directly pass the input iterator as-is in DocIdSetIteratorAcceptDocs unless the bits() is called which then updates it to a BitSetIterator. How will we do that if we disallow bits() after iterator()?

It looks like forbidding #cost() or #bits() to be called after #iterator() works with current impls, so I pushed this change.

If you call #iterator() first, you will get the original iterator, if you call it after one of the two other methods, the original iterator will be loaded into a bit set and #iterator() will return an iterator over the bit set.

@shubhamvishu Sorry, I don't get your point?

I understand from your above main comment that we don't want to load the matches in the bitset incase of flat scorer but it seems to me that allowing would make the api contract simple(but at the cost of doing unneccessary work)....Is that correct? If yes, could we somehow avoid only undesirable loading at the same time keeping the contract simple(with no restrictions)?

Basically I was considering that if user doesn't call bits() or cost() before calling iterator(), we return the same iterator(without loading or creating the BitSet), and the iterator wouldn't filter out the deletes i.e. the iterator wouldn't be a conjunction over live docs, as we previously decided. So was thinking if we could allow the caller to access both bits and iterator as needed? maybe its not that simple or the extra cost to pay if not worth?

it seems to me that allowing would make the api contract simple(but at the cost of doing unneccessary work)....Is that correct?

This is correct! I agree with your sentiment, I got there because I wanted to avoid some performance traps, but it has gone too far, the API contract is too complex. I pushed a change so that it's legal to call the methods in any order, it just happens to be a bit more efficient to call bits() before iterator() than iterator() before bits().

This looks good to me!

msokolov · 2025-08-13T21:43:15Z

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java

+      // Usage of AcceptDocs should be confined to a single thread, so this doesn't need
+      // synchronization.
+      if (bitSet == null) {
+        if (iterator.docID() != -1) {


should we check this even if bitSet is not null, for consistency with the API contract?? As it is we are exposing some internal details -- it's OK to call bits(), then iterator(), consume some of the iterator, and then call bits() again ...

It's not checked all the time.

shubhamvishu · 2025-08-22T14:22:30Z

@jpountz Thanks for the changes. It looks overall good to me. I just made 2 simple refactoring changes on top :

Moved FilteredDocIdSetIterator creation to a separate fn
Removed BitSetAcceptDocs to further simplify and merged it with BitSetAcceptDocs(with no overhead added and since it was not doing any actual work-more of a record kind).

Let me know if looks good or feel free to revert any changes. I think this looks in good shape to be merged soon (if nobody has concerns)?.

Also ran the luceneutil benchmarks to confirm there is no regression or any sort. Looks all good.

Candidate

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.515       11.386  11.121        0.977  500000   100      50       64        250     4 bits    209.48       2386.85             7         1693.01      1649.857      185.013       HNSW
 0.876       10.755  10.479        0.974  500000   100      50       64        250     7 bits    557.37        897.06             3         1872.24      1832.962      368.118       HNSW
 0.976       13.502  13.153        0.974  500000   100      50       64        250         no    236.32       2115.77             8         1504.68      1464.844     1464.844       HNSW

Baseline

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.514       11.563  11.291        0.976  500000   100      50       64        250     4 bits    214.63       2329.62             7         1693.17      1649.857      185.013       HNSW
 0.878       11.002  10.710        0.973  500000   100      50       64        250     7 bits    557.80        896.37             3         1872.39      1832.962      368.118       HNSW
 0.976       13.821  13.462        0.974  500000   100      50       64        250         no    235.64       2121.88             8         1504.54      1464.844     1464.844       HNSW

jpountz · 2025-08-23T20:09:32Z

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java

+
+    @Override
+    public DocIdSetIterator iterator() {
+      return iterator;


With my most recent commit, this is now supposed to return a new iterator, so we need to move the iterator instantiation from the ctor to here?

Done. Thanks!

Hmm, we also need to create a new iterator in the case when it's backed by DocIdSetIterator#all?

I kept it as-is as we were not creating in the case of BitsAcceptDocs in the earlier commit. But makes sense to return a new iterator irrespective.

Okay, I updated it. We always return a fresh iterator now with both APIs..

jpountz

It looks good! I added some small tests, I'll merge soon.

Co-authored-by: Adrien Grand <[email protected]> Co-authored-by: Chris Hegarty <[email protected]>

kaivalnp · 2025-10-15T17:57:45Z

We will also need to move the decision between approximate and exact to the codec

@jpountz do you mean moving the exactSearch functionality down to the codec?

If so, I think it could be useful in optimizations like #12820 too, where the main overhead is passing / re-creating information from ordinal-space to docid-space (i.e. the "visited" BitSet in our case) -- which could be avoided if exact search was done in ordinal-space..

shubhamvishu · 2025-10-15T18:52:01Z

@kaivalnp Yes, we want to move it down to the Codec. I'll soon post a PR for this change.

benwtrent · 2025-10-15T20:41:54Z

FYI, for exact search to be fast, it would also be good to have AcceptDocs be able to provide you a "size estimate" that doesn't require evaluating the entire iterator. Right now I think it does that and pretty much defeats the purpose of lazy evaluation of the iterator.

Add AcceptDocs interface for accepted KNN docs

4306547

github-actions bot added module:core/index module:core/search module:highlighter module:core/codecs module:join module:test-framework module:sandbox labels Jul 30, 2025

shubhamvishu mentioned this pull request Jul 30, 2025

Bypass HNSW graph building for tiny segments #14963

Open

shubhamvishu changed the title ~~Add AcceptDocs interface for accepted KNN docs~~ Add AcceptDocs abstraction for accepted KNN docs Jul 30, 2025

jpountz reviewed Jul 30, 2025

View reviewed changes

benwtrent mentioned this pull request Aug 4, 2025

Make BitSet no longer implement Bits. #14996

Open

Adrien's review feedback.

b8ae969

tidy

7daf0eb

shubhamvishu commented Aug 9, 2025

View reviewed changes

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java Show resolved Hide resolved

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java Outdated Show resolved Hide resolved

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java Outdated Show resolved Hide resolved

Update lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java

8f63ce1

Co-authored-by: Shubham Chaudhary <[email protected]>

github-actions bot added module:core/store module:queryparser module:misc module:core/hnsw module:build-infra labels Aug 10, 2025

github-actions bot added this to the 11.0.0 milestone Aug 10, 2025

jpountz force-pushed the accept-docs branch from 7c4a0a5 to 8f63ce1 Compare August 10, 2025 13:13

Fix assertion in testSearchWithVisitedLimit

f89ed78

fix

1ee59ef

ChrisHegarty approved these changes Aug 13, 2025

View reviewed changes

ChrisHegarty reviewed Aug 13, 2025

View reviewed changes

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java Outdated Show resolved Hide resolved

ChrisHegarty reviewed Aug 13, 2025

View reviewed changes

lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java Outdated Show resolved Hide resolved

jpountz and others added 2 commits August 13, 2025 22:35

Update lucene/core/src/java/org/apache/lucene/search/AcceptDocs.java

ff933ae

Co-authored-by: Chris Hegarty <[email protected]>

Beef up AcceptDocs contract.

115e91f

msokolov reviewed Aug 13, 2025

View reviewed changes

jpountz and others added 6 commits August 17, 2025 23:26

don't call #bits() or #cost() after #iterator()

cadad64

Merge branch 'main' into accept-docs

c67adf7

Simplify API

41f4bbc

Merge branch 'main' into accept-docs

e7fc6b8

Add static fn getFilteredDocIdSetIterator

c7e0272

Remove BitSetAcceptDocs to simplify

05893fb

jpountz reviewed Aug 23, 2025

View reviewed changes

shubhamvishu and others added 3 commits August 24, 2025 03:12

new iterator

c598d0e

minor

b8073a5

Add some tests.

491b7fa

jpountz approved these changes Aug 24, 2025

View reviewed changes

jpountz merged commit 310ef63 into apache:main Aug 24, 2025
8 checks passed

jpountz added a commit that referenced this pull request Aug 24, 2025

Add AcceptDocs abstraction for accepted KNN docs (#15011)

a253a24

Co-authored-by: Adrien Grand <[email protected]> Co-authored-by: Chris Hegarty <[email protected]>

	* @return Bits instance for random access, or null if not available
	* @return Bits instance for random access, or null if all documents are accepted

Add AcceptDocs abstraction for accepted KNN docs #15011

Add AcceptDocs abstraction for accepted KNN docs #15011

Uh oh!

Conversation

shubhamvishu commented Jul 30, 2025

Description

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jpountz commented Aug 8, 2025

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

jpountz commented Aug 8, 2025

Uh oh!

shubhamvishu commented Aug 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 9, 2025

Uh oh!

shubhamvishu commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benwtrent commented Aug 12, 2025

Uh oh!

shubhamvishu commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jpountz commented Aug 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shubhamvishu commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

shubhamvishu commented Aug 12, 2025 •

edited

Loading

shubhamvishu commented Aug 13, 2025 •

edited

Loading

shubhamvishu commented Aug 22, 2025 •

edited

Loading

shubhamvishu Aug 23, 2025 •

edited

Loading