Skip to content

Conversation

shubhamvishu
Copy link
Contributor

Description

Addresses this comment on the PR #14963 to allow both sequential and random access consumption for accepted docs in KNN search.

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@shubhamvishu shubhamvishu changed the title Add AcceptDocs interface for accepted KNN docs Add AcceptDocs abstraction for accepted KNN docs Jul 30, 2025
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into it, it looks like this refactoring is working well. I left a first round of suggestions.

/**
* Random access to the accepted documents.
*
* @return Bits instance for random access, or null if not available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should apply the usual contract that a null Bits instance means that all docs are accepted?

Suggested change
* @return Bits instance for random access, or null if not available
* @return Bits instance for random access, or null if all documents are accepted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this resolved? I do think that "or null if all documents are accepted" is more clear.

fieldEntry.similarityFunction,
getGraphValues(fieldEntry),
getAcceptOrds(acceptDocs, fieldEntry),
getAcceptOrds(acceptDocs != null ? acceptDocs.getBits() : null, fieldEntry),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set the contract of this method so that acceptDocs is not allowed to be null? This would help save all these (annoying) null checks, and it wouldn't be a great burden on the caller side?

@jpountz
Copy link
Contributor

jpountz commented Aug 8, 2025

@shubhamvishu I hope you don't mind, I addressed my feedback directly on your branch.

Copy link
Contributor

github-actions bot commented Aug 8, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Copy link
Contributor

github-actions bot commented Aug 8, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@jpountz
Copy link
Contributor

jpountz commented Aug 8, 2025

We will also need to move the decision between approximate and exact to the codec, but it's probably best done as a follow-up PR? I think that this change is good as-is.

@shubhamvishu
Copy link
Contributor Author

I hope you don't mind, I addressed my feedback directly on your branch.

@jpountz Not at all, thank you so much for taking care of it! Sorry I couldn’t get to it sooner. I really appreciate the assist.

 

We will also need to move the decision between approximate and exact to the codec, but it's probably best done as a follow-up PR? I think that this change is good as-is.

I agree completely , that is unrelated to this specific change and better to keep as separate PR.

Copy link
Contributor

github-actions bot commented Aug 9, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@shubhamvishu
Copy link
Contributor Author

shubhamvishu commented Aug 12, 2025

@jpountz @benwtrent The test failures seems is unrelated to this change and a side effect of bulk scoring change. The test fails on main also when HNSW is disabled to force exhaustive search and we score bulk score 64 docs now. I wonder if we still want to respect the visitedLimit with bulk scoring which it seems we are not doing anymore? If yes, then it seems like a existing bug which we should fix separately?. Otherwise if not, we could only fix the assertion here. Let me know what you think?

@benwtrent
Copy link
Member

@shubhamvishu could you open a test failure issue?

@ChrisHegarty what do you think?

@shubhamvishu
Copy link
Contributor Author

shubhamvishu commented Aug 13, 2025

@benwtrent I opened an issue #15057 with the details of the test failure

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generally very nice. I left a few small comments.

/**
* Random access to the accepted documents.
*
* @return Bits instance for random access, or null if not available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this resolved? I do think that "or null if all documents are accepted" is more clear.

* @return Bits instance for random access, or null if not available
* @throws IOException if an I/O error occurs
*/
public abstract Bits bits() throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any requirement for either this bits or iterator to throw IOException. It can be removed unless there is some other usage not currently in this PR that needs it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was thinking of lazily creating the bitset in this method at first, so it would need the IOException. But actually we always need the cardinality of the bitset to decide between random access and sequential access now, so maybe we could simplify things and make AcceptDocs a record around Bits and a DocIdSetIterator. So no exception, no lazy computation of cost() (to you other comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I spoke too quickly. A flat scorer should never compute the cost() or use the Bits. So we need to both compute the cost and the Bits lazily.

@jpountz
Copy link
Contributor

jpountz commented Aug 13, 2025

One of Chris' comment helped me realize that we would still load filter matches into a BitSet with a flat scorer, which is undesirable. So I pushed a change that makes the loading into a bitset lazy, so that if a flat KnnVectorsReader only calls AcceptDocs#iterator but neither #cost() nor #bits() then we'd never load matches into a bitset. I had to restrict the contract of AcceptDocs so that it's illegal ta call #cost() nor #bits() after the iterator has advanced.

/**
* Random access to the accepted documents.
*
* <p><b>NOTE</b>: This must not be called if the {@link #iterator()} has already been used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the contract if bits() is called first, and then iterator()? Is it possible the iterator will have been consumed in the process of creating the Bits?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. The original iterator will be consumed to create the bit set and replaced with an iterator over the bit set.

So we should disallow calling bits() or cardinality() after iterator()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if we first call the iterator() then we directly pass the input iterator as-is in DocIdSetIteratorAcceptDocs unless the bits() is called which then updates it to a BitSetIterator. How will we do that if we disallow bits() after iterator()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like forbidding #cost() or #bits() to be called after #iterator() works with current impls, so I pushed this change.

If you call #iterator() first, you will get the original iterator, if you call it after one of the two other methods, the original iterator will be loaded into a bit set and #iterator() will return an iterator over the bit set.

@shubhamvishu Sorry, I don't get your point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand from your above main comment that we don't want to load the matches in the bitset incase of flat scorer but it seems to me that allowing would make the api contract simple(but at the cost of doing unneccessary work)....Is that correct? If yes, could we somehow avoid only undesirable loading at the same time keeping the contract simple(with no restrictions)?

Basically I was considering that if user doesn't call bits() or cost() before calling iterator(), we return the same iterator(without loading or creating the BitSet), and the iterator wouldn't filter out the deletes i.e. the iterator wouldn't be a conjunction over live docs, as we previously decided. So was thinking if we could allow the caller to access both bits and iterator as needed? maybe its not that simple or the extra cost to pay if not worth?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to me that allowing would make the api contract simple(but at the cost of doing unneccessary work)....Is that correct?

This is correct! I agree with your sentiment, I got there because I wanted to avoid some performance traps, but it has gone too far, the API contract is too complex. I pushed a change so that it's legal to call the methods in any order, it just happens to be a bit more efficient to call bits() before iterator() than iterator() before bits().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

// Usage of AcceptDocs should be confined to a single thread, so this doesn't need
// synchronization.
if (bitSet == null) {
if (iterator.docID() != -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check this even if bitSet is not null, for consistency with the API contract?? As it is we are exposing some internal details -- it's OK to call bits(), then iterator(), consume some of the iterator, and then call bits() again ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not checked all the time.

@shubhamvishu
Copy link
Contributor Author

shubhamvishu commented Aug 22, 2025

@jpountz Thanks for the changes. It looks overall good to me. I just made 2 simple refactoring changes on top :

  1. Moved FilteredDocIdSetIterator creation to a separate fn
  2. Removed BitSetAcceptDocs to further simplify and merged it with BitSetAcceptDocs(with no overhead added and since it was not doing any actual work-more of a record kind).

Let me know if looks good or feel free to revert any changes. I think this looks in good shape to be merged soon (if nobody has concerns)?.


Also ran the luceneutil benchmarks to confirm there is no regression or any sort. Looks all good.

Candidate

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.515       11.386  11.121        0.977  500000   100      50       64        250     4 bits    209.48       2386.85             7         1693.01      1649.857      185.013       HNSW
 0.876       10.755  10.479        0.974  500000   100      50       64        250     7 bits    557.37        897.06             3         1872.24      1832.962      368.118       HNSW
 0.976       13.502  13.153        0.974  500000   100      50       64        250         no    236.32       2115.77             8         1504.68      1464.844     1464.844       HNSW

Baseline

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.514       11.563  11.291        0.976  500000   100      50       64        250     4 bits    214.63       2329.62             7         1693.17      1649.857      185.013       HNSW
 0.878       11.002  10.710        0.973  500000   100      50       64        250     7 bits    557.80        896.37             3         1872.39      1832.962      368.118       HNSW
 0.976       13.821  13.462        0.974  500000   100      50       64        250         no    235.64       2121.88             8         1504.54      1464.844     1464.844       HNSW


@Override
public DocIdSetIterator iterator() {
return iterator;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my most recent commit, this is now supposed to return a new iterator, so we need to move the iterator instantiation from the ctor to here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we also need to create a new iterator in the case when it's backed by DocIdSetIterator#all?

Copy link
Contributor Author

@shubhamvishu shubhamvishu Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it as-is as we were not creating in the case of BitsAcceptDocs in the earlier commit. But makes sense to return a new iterator irrespective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I updated it. We always return a fresh iterator now with both APIs..

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good! I added some small tests, I'll merge soon.

@jpountz jpountz merged commit 310ef63 into apache:main Aug 24, 2025
8 checks passed
jpountz added a commit that referenced this pull request Aug 24, 2025
Co-authored-by: Adrien Grand <[email protected]>
Co-authored-by: Chris Hegarty <[email protected]>
@kaivalnp
Copy link
Contributor

We will also need to move the decision between approximate and exact to the codec

@jpountz do you mean moving the exactSearch functionality down to the codec?

If so, I think it could be useful in optimizations like #12820 too, where the main overhead is passing / re-creating information from ordinal-space to docid-space (i.e. the "visited" BitSet in our case) -- which could be avoided if exact search was done in ordinal-space..

@shubhamvishu
Copy link
Contributor Author

@kaivalnp Yes, we want to move it down to the Codec. I'll soon post a PR for this change.

@benwtrent
Copy link
Member

FYI, for exact search to be fast, it would also be good to have AcceptDocs be able to provide you a "size estimate" that doesn't require evaluating the entire iterator. Right now I think it does that and pretty much defeats the purpose of lazy evaluation of the iterator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants